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Abstract 

Within the unmanageably large class of nonconvex optimization, we consider the rich sub- 
class of nonsmooth problems that have composite objectives — this already includes the exten- 
sively studied convex, composite objective problems as a special case. For this subclass, we 
introduce a powerful, new framework that permits asymptotically non-vanishing perturbations. 
In particular, we develop perturbation-based batch and incremental (online like) nonconvex 
proximal splitting algorithms. To our knowledge, this is the first time that such perturbation- 
based nonconvex splitting algorithms are being proposed and analyzed. While the main contri- 
bution of the paper is the theoretical framework, we complement our results by presenting some 
empirical results on matrix factorization. 



1 Introduction 

Within the unmanageably vast class of nonconvex optimization, we consider the rich subclass of 
problems that have nonconvex composite objectives. Specifically, we study problems of the form 



minimize 



F(x)+ip(x), s.t.xeX, 



(1) 



where X C K" is a compact convex set, F : K™ — > K. is a differentiable function, and ip ■ l n —> K is 
a lower semi-continuous (lsc) convex function. We make the common assumption that F £ C\(X), 
i.e., the gradient V-F is (locally) Lipschitz continuous on X with constant L, 



\VF(x)-VF(y)\\<L\\x-y\\ for all x,y e X. 



(2) 



Problem (fT|) is a natural but far-reaching generalization of composite object i ve convex problems that 
enjoy tremendous importance in machine lear ning; see (Duchi and Singer . 120091 iBachet all l201lL 



ISchmidt et ail . l2009llB~eck and Teboullel l2009j . for example. Although, convex formulations are ex- 
tremely useful, often for many difficult p roblems a nonconvex formulation i s more natural. Familiar 
exa mples include, matrix factorizat ion Mairal et al. . 2010 L Lee and Seund. 2000| . blind deconvolu- 
tion [Kundur and Hatzinakosl Il996|, dictionary learning Kreutz-Delgado et all 120031 Mairal et al 



2010]. and neural networks (Mangasarianl Il993l lHavkinl . 11994 



* Reformatted, slightly enhanced version of MPI for Intelligent Systems Technical Report MPI-IS-2. 
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The main contribution of our paper is a new framework NOCOPS (an acronym for nonconvex 
proximal splitting). This framework solves (JlJ while allowing computational errors, a capability 
that proves key to deriving scalable batch and incremental (online like). A realistic feature of our 
framework is that it does not require the computational errors to vanish in the limit or its stepsizes 
to shrink to zero; such choices that are often assumed in standard analysis of (c onvex) incremental - 
gradient like methods Bertsekas . 2010l j or e ven in stochast ic gradient methods [Gaivoronski . 19941 . 

NOCOPS builds on the remarkable work of lSolodovl Il997l , but it is strictly more general than Solodov 

1997] (which solves (JlJ only for = 0). Like ISolodovl |1997j |'s framework, NOCOPS also allows non- 
vanishing errors, which is practical, since often one has limited or no true control over computational 
errors (e.g., fixed noise level in a simulation). To our knowledge, ours is the first work on nonconvex 
proximal splitting that has both batch and incremental variants, regardless of the ability to handle 
nonvanishing errors. 



Rela ted Work. Among batch nonconvex splitting methods, an early pa per is Fukushima and Mind . 
198lj . More recently, in his pio neering paper on composite minimization, NesterovT 2007j solved (|Tj) 
via a splitting- like algorithm. iFukushima and Mine 1981 1 ensured convergence by forcing mono- 
tonic descent (using line-search): iNesterovl 2007f prov ed convergence (for t he nonconvex case) by 
also ensuring monotonic descent. Even more recently, lAttouch et al. [2010] introduced a powerful 
method based on Kurdyka-Lojasiewicz theory, though convergence again hinged on descent. This in- 
sistence on monotonic descent makes these methods unsuitable to obtaining incremental, stochastic, 
or online variants. 

But there are some incremental a nd stochastic methods that do apply to (Q]), namely the gen- 
eralized gradien t- type algorithms of Solodov and Zavriev . 1998} and stochastic generalized gradi- 
ent methods of |Ermoliev and Nor kin , 19971 1998} . Both approaches are analogous to subgradient 
meth ods from convex optimization, and face similar difficulties. For example, as is well recognized 
(see Nesterovi . 12007 . Duchi and Singer . 20091 ] . e.g.), subgradient-style methods fail to exploit com- 
posite objectives. Moreover, they exhibit the effect of the regularize! - only in the limit; for example, 
if i/j(x) — 1 1 £ ||i, then the sparse solutions are obtained only in the limit, and intermediate iterates 
may be dense. 

For c onvex problems, a powerful alte rnative to subgradient methods is offered by proximal split- 
ting (see Combettes and Pesquet . 201dj ] for a survey). These methods split (fTJ into smooth and non- 
smooth parts. The smooth part is handled as in gradient-projection while the nonsmooth part is han- 
dled via a proximity operator. Owing to their ability to effectively tackle t he nonsmooth part, prox- 
imal methods become valuable in machine learning and related areas; se e Combettes and Pesauetl . 
20ld . lBeck and Tebouliel , l2009llBa"ch et alllibTlilDuchi and Singerl[20bH and the references therein. 



2 The Nocops Framework 

We begin by defining the function g : M" — > K to be the sum g(x) := ip(x) + 5(x\X), 

where 5{x\X) is the indicator function for the set X . With this notation, the main problem of 
this paper is 

mininuze xe]R n <p{x) := F(x) + g(x). (3) 
Next, we recall a definition central to our analysis. 

Definition 1 (Proximity operator) . Let g : R n — > R be lsc and convex. The proximity operator for 
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<7, indexed by scalar 77 > 0, is the nonlinear map [see e.g., Rockafellar and Wets . 19981 Def. 1.22]: 

P 9 : y^> argmin (g(x) + -^-\\x ~ y\\ 2 ) . (4) 
' xeB « V 277 / 



Proximity operators are key to forward-backward splitting [Combettes and Wa which, 
for convex F G C^(X) and appropriate stepsizes rjk, optimizes ([1]) by essentially iterating 

x k+i = {x k _ %V F(z fc )), k = 0, 1, . . . . (5) 

Our new framework NOCOPS introduces two powerful generalizations to (0. First, it allows F 
to be nonconvex, and second, it allows perturbations. Formally, NOCOPS performs the iteration 

x k+1 =Pl(x k - Vk VF(x k )+ Vk V(x k )), fc = 0,l,..., (6) 

where the stepsizes rjk satisfy the standard bounds and conditions 

c < liminf/j rjk, limsup & rjk < min {1, 2/L — c} , < c < 1/L. (7) 

The perturbation term r/k$(x k ) in ((6|) represents the computational errors, which occur for example 
when only an approximation to the full gradi ent VF(a ; ) is av ailable. 

To make NOCOPS well-defined, following [Solodov . 1997 1, we also impose a mild restriction on 
the perturbations. Specifically, we assume that for all 77 smaller than a fixed value 77, it holds that 

^H^O^)!! < £j f° r some fixed e > 0, and \/x E X. (8) 

Condition |S| is weaker than the typical vanishing error requirement >7||i?(a; )|| — > imposed by 
most analyses. Since nonvanishing errors are allowed, exact stationary points cannot not always be 
obtained, but appropriate inexact stationary points can be found. To formalize this, recall that a 
point x* £ K™ is stationary for ([3]), if and only if it satisfies the inclusion 

0Ed c cb{x*):=VF(x*) + dg(x*), (9) 

where dc4> denotes the Clarke (generalized) subdifferential [Clarkel 1983j . Inclusion ([9]) may be 
equivalently recast as the fixed-point condition 

x* = PB{ X * - r)VF{x*)), for n > 0. (10) 

We use (|10|) to characterize approximate stationary points. Define thus the proximal residual 

p{x) := x - Pf{x - VF{x)), (11) 

so that for stationary x* , the residual norm ||p(a;*)|| = 0. At point x, let the level of perturbation be 
given by e(x) > 0. We define a point x to be e-stationary if the residual norm is bounded satisfies 

\\p(x)\\<<x). (12) 

To control overall level of perturbation in the system, we require e(x) > ?j||i?(af)||. Thus, intuitively, 
by letting 77 become small enough, we can obtain a stationary point of any desired accuracy. 
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2.1 Convergence analysis 

Our analysis builds on the pioneering works of Nesterov [2007] and ISolodov 1997 1 . Bu t as men- 
tione d, our problem and analysis are strictly more general. Specifically, in co ntrast to Nesterov , 
2007 1 . we permit perturbations and do not rely on strict descent, and unlike [Solodov , 1997| . we 



consider nonsmooth objective functions. Our analysis yields, to our knowledge, the first noncon- 
vex proximal splitting algorithm with nonvanishing noise, and also the first nonconvex incremental 
proximal splitting algorithm, regardless of vanishing or nonvanishing nature of the noise. 

We begin by recalling two simple facts without proof; the first is a classical descent lemma. 



Lemma 1 (Descent lemma [see e.g.. iNesterovl [20041 . Lemma 2.1.3]). Let F £ C^(X). Then, 



(13) 



\F(x) - F(y) - (VF(y), x - y)\ < f \\x - yf, V x,y £ X. 

Lemma 2 (Nonexpansivity [see e.g.. ICombettes and Waisl . 20051 Lemma 2.4]). The operator Pf] i 
nonexpansive, that is, 



\\P°x-P°v\\ < \\x-y\\, Vx,yeM n . 
Next, we prove a useful monotonicity result about proximity operators. 
Lemma 3 (Monotonicity). Let P n = P^; let y, z £ W 1 , and r\ > 0, and define 

p(v) '■= v~ 1 \\ p v(y - v z ) - v\\, 
q(v) ■= \\Pn(y - vz) - y\\- 

Then, p(rj) is a decreasing function off], while q(rj) is an increasing function ofrj. 



(14) 



(15) 
(16) 



Proof. Both (1151) and ([TBI essent ially follow from properties of Moreau-envelopes Rockafellar and Wets , 



1998, Combettes and Waist |2005| (also see |Nesterov[[2007| ). To set our notation, we provide a proof 



in the language of proximity operators. Consider thus the "deflected" proximity objective 

m g (x,ri;y,z) := (z, x - y) + |?7 _1 ||a: - y\\ 2 + g(x), 
to which we associate the (deflected) Moreau- envelope 

Function m g is easily seen to be a convex (see e.g., jRockafellar and Wetsl . Il998l Theorem 2.26]); 
thus it follows that the infinum in (fl8|) is attained at the unique point PS{y — i]z). Thus, E g (rj) is 
diffcrcntiable, and 



(17) 



(18) 



E' g ( v ) = -^- 2 \\P^ y-vz)- y „ - 2i 

Since E g (r]) is convex ( [Rockafellar and Wetsl . Il99l Theorem 2.26]), E' g is increasing; equivalently 
p(?y) is decreasing. Similarly, note that E a (^) := E a (l/"/) is concave in 7 , as it is a pointwise (indexed 
by x) infimum of functions linear in 7 Bovd and Vandenberghd . 12004 . §3.2.3]. Thus, its derivative 



^(7) = 5p? /7 (* - i^y) ^ll 2 = 9(1/7), 



is a decreasing function of 7; writing 77 = 1/7 completes the argument, and proves (fl6 



□ 



Remark 1. The monotonicity results (flBf and (|16l) comp lement similar monotonicity results for 
projection operators derived in Gafni and Bertsekasl . Il984l Lemma 1]. 
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Now we analyze the difference 4>(x k ) — (f>(x k+1 ); specifically, we derive an inequality 

4>{x k ) - (j){x k+1 ) > h{x k ), (19) 

where the potential function h(x) depends on and e(x). Note that the potential h(x) is 

allowed to take on negative values because we do not insist on monotonic descent. To simplify 
notation, let u = x k+1 , x = x k , and r\ = rjk, so that update © becomes 

u = P ri (x-r]VF(x)+r]'3(x)). (20) 

With this notation, we now have the following descent-like theorem. 

Theorem 1. Let x £ X , u, rj be as in (|20p , and e(x) > 7j||i?(ie)||. Then we have the bound 

<Kx)-<K*) > ^\\u-xf-le(x)\\u-x\\. (21) 

Moreover, if 4>'{u) is the subgradient 4>'(u) :— VF(u) + s£ dc<p{u), then we also have the bound 

(<f>'(u),x-u) > l^fl\\u-x\\ 2 -e(x)\\u-x\\ (22) 

Proof. Consider the directional derivative dm g (of m gi with respect to x, and in the direction w), 
which satisfies at x — u the optimality condition 

dm g (u, n; y, z)(w) = (z + 7] _1 (w — y) + s, w) > 0, s £ dg(u). (23) 

In (1231) . set z = VF(x) — $(x), y — x, and w — x — u; then, rearrange to obtain 

(VF(x)-*d(x),u-x) < (j ] - 1 (u-x) + s, x-u). (24) 

By Lemma [T] we have 

(j){u) < F(x) + {VF(x), u - x) + -§ x\\ 2 + g(u). (25) 
Adding and subtracting d(x) in (|25|) . and then combining with ([24)) we further obtain 

<j>(u) < F(x) + (VF(x) - u- x) + - a;|| 2 + g(u) + (d(x), u - x) 

< F(x) + (J7 _1 (tt -x) + s,x-u) + - x\\ 2 +g(u) + (tf(x), u - x) 
= F(x) + g(u) + (s,x-u) + (±- |) ||u - x\\ 2 + {ti(x), u - x) 

< F(x) + g(x) - *dp\\u - x\\ 2 + (0(x), u-x) 

<^ x )- 2 -^l\\u-x\\ 2 + \\V{x)\\\\u-x\\ 

<^( x )- 2 -^fl\\u~x\\ 2 + ^(x)\\u-xl 

where the third inequality follows from convexity of g 7 the fourth one from Cauchy-Schwarz, and 
the last one from the definition of e(x). Flipping signs, we immediately obtain ff23|) . 

To prove (|22p . consider the directional derivative d<f>(u;x — u), for which using the identity 
4>'{u) = VF(u) + s, we have 

(VF(u) +s,x-u) = (VF(x) - i?(a:) +s,x-u}- (VF(u) - VF(x) + &(x), u-x) 

> (?7 _:L (x -u),x-u)- (VF(u) - VF(x), u-x)- ($(x), u-x) 

> (tT 1 - L)\\u - x\\ 2 - (tf(x), u-x) 
>^-M\\u-x\\ 2 -\\V{x)\\\\u-x\\ 

> hzMWu-xW 2 -rj-^ix^u-xlU 
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Next, we bound the right-hand side terms in (|21[) , by deriving two-sided bounds on \\x — u\\. 
Lemma 4. Let x, u, and r\ be as in Theorem^ and c be as defined in ([7]). Then, 

c\\p(x)\\ - rrM*) < W x - U W < \\P( X )W + TMs). (26) 
Proof. From Lemma [3] we see that for rj > it holds that 

1 < i] => q{\) < q(ri), and 1 > rj =>■ < piyj) = i]~ 1 q(r]) (27) 
Using (|27p . the triangle inequality, and nonexpansivity (Lemma [5]), we see that 

min {1,77} q(l) = min{l,r/} < ||-P„(:k - r]Wf(x)) - x\\ 

< \\x-u\\ + \\u-Pr,(x-rjVf(x))\\ 

< \\x - u\\ + \\i)(x)\\ < \\x - u\\ + f]~ 2 e(x). 

Since c < liminffe rjk, we have \\x — u\\ > c\\p(x)\\ — e(x). An upper bound follows upon noting 

\\x - u\\ < \\x - P v (x - r,Vf(x))\\ + \\P n (x - r,S7f{x)) - u\\ 

< max {1, V } \\p(x)\\ + \\0{x) || < 11^)11 +ry- 1 e( a; ). □ 

Theorem [I] and Lemma |4] have done the hard work; they imply the following crucial corollary. 

Corollary 1. Let x, u, T), and c be as in Lemma^ and Theorem^ Then, we have 

4>(x) — 4>{u) > h(x), 

where the function h is given by 

h(x) := ^ c 2 \\p(x)\\ 2 - (c^ + i;)\\p(x)Mx) - + ^)e(x) 2 . (28) 

Proof. Plug in the bounds ([26| into ()21|) and simplify. □ 

Now we are in a position to state our main convergence theorem. 

Theorem 2 (Convergence). Let f 6 C^(X) such that mix f > — °o md g be Isc, convex on X. 
Let {x k } C X be a sequence generated by (ty). and let condition 0] hold. Then, there exists a limit 
point x* of the sequence {x k } , and a constant K > 0, such that \\p(x*)\\ < Ke(x*). Moreover, if the 
sequence {F(x k )} converges, then for every limit point x* of {x k } it holds that ||r(a;*)|| < Ke(x*). 

Proof. The hard work has been done by Theorem [T] and Lemma 21 because of which we have the 
bound <j){x) — 4>(u) > h{x), where the function h(x) is defined as 

h{x) := ^c 2 \\ P (x)f - {c 1 -^ + l)||p(z)||e(z) - + l)e{x)\ (29) 

Thus, Corollary [T] reduces the proof to a case where the arguments of [Solodov . 1997 1 become 
applicable; thus, we may conclude convergence; we omit details to avoid repetition. □ 



G 



3 Incremental Nocops 



We now specialize NOCOPS to the large-scale setting with decomposable F(x), that is, to 

minimize (F(x) := f t {xj) + g(x), (30) 

where each f t : M. n — > M. is a C\ (X) function (let L t < L for simplicity), and g, X are as before. 

In machine learning and optimization it has long been known that for decomposable objectives it 
can be advantageous to replace the full gradient V-F(x) by an incremental gradient V/ r (t)(x), where 
r(t) is some suitable index. Incremental methods have been extensively analyzed in the setting of 
backpropagation algorithms [Bertsekasl 2010llSolodov[ Il997 [. a setting that c orresponds to q(x) = 



in ou r case. For g(x) ^ 0, the stochastic ge neralized gradient methods of [Ermoliev and Norkin 



1998] or the perturbed generalized methods of Solodov and Zavrievl Il998 | apply. But as previously 



argued, these methods suffer from problems similar to those faced by ordinary subgradient methods; 
so we may prefer proximal splitting methods instead. 

To specialize NOCOPS for solving (1301) . we propose the following iteration: 

x k+1 = M(x k -r, k Y, T t=1 V] *(*')) 



(31) 



1 ; - H '--Otf-riVMz*)), t = l,...,T-l. 



Here, O and A4 are appropriate nonexpansive maps, choosing which we get different algorithms. 
For example, when X = R" g(x) = 0, and M. = O = Id, then ([3T|) reduces to the problem class 
considered in Solodovi Il998j . If X is a closed convex set, g(x) = 0, M. = Hx, and O = Id, then 



(|3T|) reduces to a method that is essentially implicit in (Solodovi . Il998| . Note, however, that in this 
case, the constraints are enforced only once every major iteration; the minor iterates (z r ) may be 
infeasible. 

We introduce below four variants of (I31[) ; to our knowledge, all four are novel. 

1. X = K n , g(x) ^ 0, M. — P9, and O = Id; this is a penalized unconstrained problem, and the 
penalty is applied once every major iteration. 

2. X — R", g(x) ^ 0, M. — P<j, and O = P-j; this is a penalized unconstrained problem, but now 
the penalty is applied at every minor iteration. 

3. X is a closed convex set, g(x) — ij){x) + S(-\X) (where ip may be zero or nonzero), M = P^, 
and O = Id; this is a penalized, constrained problem, and the penalty is applied once every 
major iteration. 

4. Same as variant 3, except that O = P^; this is a penalized, constrained problem, and the 
penalty is applied at every minor iteration. 

Which of the four variants one prefers depends on the complexity of the constraint set X and 
the regularizer g(x). However, the analysis of all four variants is similar, so we present details only 
for the fourth, as it is the most general. 
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Input: {X7f t (X)}, P%: subroutines 


Output: Approximate solution to (|31[1 


£ 


<- 1; k <- 0; x k e X; 


while -i converged do 




z 


<- x- fc ; Qi <- 




if t<T then 






Get i-th gradient g l = V/t(z*); 
Compute z t+1 «- P^ t (z* - Q t g'); 










Aggregate gradient F fc <- F fc + g l \ 






t <— t + 1; update stepsize at; 




else 






z fc+1 = P 17 9 f > fc -7 ?fe F fc ); 






i <- 0; <- k + 1; 






Check for convergence; 









Algorithm 1: Incremental NOCOPS. 



3.1 Convergence 

We begin by rewriting (13TT) in a form that matches the main iteration (|6]): 

x k+1 =M(x k -rjkY^^Vftiz 1 )) = M(x k -n k VF(x k )+ Vk tf(x k )) , 

where the error term at a general x is given by $(x) :— Y^t=i{ft( x ) ~ ft(z 1 ))- We must ensure that 
the norm of the error term is bounded. Lemma [7] proves this bound; however, we first prove two 
helpful lemmas. 

Lemma 5 (Bounded increment). Let z t+1 be computed by (|31[) . Then, we have 

if O = Id, then = ,,||V/t(z*)|| (32) 

if = U x ,then ||z t+1 -z*|| < r/||V/ t (z*)|| (33) 
if = P° and s* edg(z% then \\z t+1 - z l \\ < 2n\\Vf t {z t ) + s*||. (34) 

Proof. Relation (|32[) is obvious and (|33|) follows immediately from nonexpansivity of projections. 
For proving (|34l) . notice that definition ((JU) implies the inequality 

- z* + v^Mz'W + vg(z t+1 ) < |l|r?V/ t (^)|| 2 + w (*<), 

- zT < Vi^ftiz 1 ), z* - z t+1 ) + ^giz 1 ) - g(z t+1 )). 
But since ip is convex, we know that 

g(z t+1 ) > gizJ) + (s u z t+1 - z 4 }, s t € dg(z*). 
Since g is convex it further follows that 

\\\z t+1 - z*|| 2 < Vis', z* - z t+1 ) + <V/ t (z 4 ), z* - z 4+1 ) 

<^||«t + v/ t (*W-* t+1 ll 

=> ||z 4+1 -z 4 || <2ry||V/ 4 (z t ) + S t ||. 

□ 
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Lemma 6 (Incrementality error). Let x = x k , and define 

e* := ||V/*(2!*) - V/*(a:)||, t = l,...,T. (35) 
Then, for each t > 2, the following bound on the error holds: 

e t <2nLY!~ = 1 1 ^ + 2vL) t - 1 - 3 \\Vf J (x) + S i\\, t = 2,...,T. (36) 



Proof. The proof extends the unconstrained, unpenalized setting of Solodov , 1998j to our setting. 
We proceed by induction. The base case is t = 2, for which we have 

e 2 = \\Vh{z 2 )~\7h{x)\\<L\\z 2 -x\\=L\\z 2 -z 1 \\ < 2 V L\\Vf 1 (x) + s 1 ]]. 
Assume inductively that (|36p holds for t < r < T ', and consider t = r + 1. In this case we have 
e r+1 = ||V/ r+ i(z r+1 )-V/r+i(z)|| < L\\z r+1 -x\\ 

Lemma [5] w — - r 

^ 2r?i^. =i l|V/ j (^) + ^||. (37) 

To complete the induction, first observe that ||V/t(z') + s t \\ < ||V/t(x) + s*|| + e f . Thus, invoking 
the induction hypothesis, we obtain 



||V/ t (z')ll < ||V/ t (a ! )||+2t J LV. (l + 2 V Ly- 1 -j\\Vf j (x) + si\\, t = 2,...,r. (38) 

i — 0=1 

Combining inequality (I38[) with (|37[) we further obtain 

e r+ i < 2r/L^ =i (j|V/,(x) + ^|| + 27^^(1 + Lrj) 3 ^ 1 ^ 1 1| V/;(x) + s l 
Introducing the shorthand /3j = \\\7fj(x) + s J ||, simple manipulation of the above inequality yields 
e r+1 < 2r,L/3 r + ( 2 ^ + 4 V 2 L 2 E ] =l+1 ( X + W- 1 - 1 ) ft 

- 2ryL/3 r + f^nL + 4 V 2 L 2 J^'V + 2r > L )')) # 

= 2ryL/3 r + ^J i 1 2^(l + 2ryLr-'A - 2^^^ + ^L) r ~ l Pu 

which completes the proof. □ 

Now we are ready to bound the error using Lemma [3 

Lemma 7 (Bounded error). If for all x 6 X, ||V/t(x)|| < M and \\dg{x)\\ < G, then ||i?(x)|| < 
K\, for some constant K\ > 0. 

Proof. First, observe that if z t+1 is computed by (|3~Tj) . O — Pfj, and s* G dg(z t ), then 

||* t+1 -**|| < 2r 7 ||V/ t (/) + S t ||. (39) 



(41) 



Using (|39[) we can bound the error incurred upon using z t instead of x . Specifically, if x = x k , and 

et:=Wt(z*)-^Mx)\\, t = l,...,T, (40) 
then Lemma [5] shows the following bound: 

e t < 2 V L V*"* (1 + 2t 1 L) 1 - 1 -' \\Vfj(x) + t = 2, . . . ,T. 

' — 0=1 

Since ei = 0, we have 

T 1411 T f — i 

< EL^ 1 + 2 ^) T ~'& ^ (! + w 1 E^ 1 ||v/t(a;) + si|1 

< d(T- 1)(M + G) =: K x . □ 
Remark 2. Lemma [7] implies that if in the error condition (|8|), we let 77 — 5- 0, then ?7||$(a;)|| — > 0. 



Given the error bounds established by Lemma [JJ convergence results for Algorithm [T] follow 
immediately from the more the general Theorem [5] we omit details to avoid repetition. 



4 Application 

The main contribution and focus of this paper is the new NOCOPS framework, and studying a 
specific application is not one of the aims of this paper. Nevertheless, we do illustrate Nocops's 
empirical performance on a challenging nonconvex problem, namely, (penalized) nonnegative matrix 
factorization: 

x min o ^\\Y-XA\\ 2 F + MX) + J2 T t=1 Mat), (42) 

where Y is an m x T matrix, ZismX K, and A is K x T with a\, . . . , ay as its columns. Prob- 
lem (|42t extends the famous nonnegative matrix factorization (NMF) problem |Lee and Seund . 200d| 
by allowing Y to be arbitrary (not necessarily nonnegative) and adding nonsmooth regularizers on 
X and A. 

A similar class of problem s was recently also studied in [Mairal et al. . 2010j j . but with a crucial 
difference: the formulation in iMairal et all |2010| does not allow nonsmooth regularizers on X (the 
class of proble ms studied in Mairal et all 201^ is a subset of those our framework allows). On a 
technical note, Mairal et al. , 2010j consider stochastic optimization methods whose analysis requires 
perturbations to disappear in the limit; while our method is deterministic and our analysis does not 
rely on disap pearing perturbation s. 

Following Mairal et al. . 2010| we rewrite (|4"2l in a form more amenable to NOCOPS, that is, 



minx <f>(X);=J2 t=1 ft(X)+g(X), 

where g(X) captures both ipo(X) and the constraints on X. Each ft(X) is defined as 
f t (X) := min \ \\y t - Xa\\ 2 + g t (a), l<t<T, 



(43) 



(44) 
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where gt{a) captures both ipt{a) and the constraints on at- Whenever (|44l) attains its unique mini- 
mum, say a*, then ft(X) is differentiable and we have Vxft(X) — ivt — Xa*)(a*) T . Thus, we can 
instantiate Algorithm [TJ all we need is a subroutine for solving (l4~4l I 1 ! 

In our experiments, we consider the following two instances of f|43|) : (i) g(X) = S(X\> 0); and 
(ii) g(X) — A||X||i + S(X\> 0). We select the g t 's to be matching, so that g t = S(a\> 0), and 
gt(a) = 7||«||i +S(a\> 0) are used. Choice (i) solves standard NMF, while choice (ii) solves a sparse- 
NMF problem. We remark that in general, when penalizing X one should either constrain a t or 
penalize it, otherwise one can get degenerate solutions. 

To provide the reader with a basel ine, on the basic NM F problem we compare NOCOPS against 
the well-tuned C++ toolbox SPAMS [Mairal et all l2010j . Obviously, the comparisons are not fair 



to NOCOPS, because unlike SPAMS, it is implemented in Matlab. Fortunately, our Matlab 
implementation already runs very competitively, and unlike SPAMS, also allows factorizing sparse 
matrices. We note that since our subroutines depend heavily on matrix- vector operations, a well- 
tuned C++ implementation of NOCOPS should run at least 3-4 times (based on initial experiments) 
faster than our Matlab version, especially for sparse matrices. 

Remark: The speed comparisons should not detract the reader from the main message: NOCOPS 
offers a new proximal operator based optimization framework that runs very competitively. With 
careful implementation, one can expect to see huge gains in speed, but engineering such a careful 
implementation is not one of the main points of this paper. 

We compute NMF on the following data matrices: 



1. CBCL Face Database Sund . 19961 ] (dense, size 361 x 2429); we compute a rank-49 factorization. 



2. Yale B Database |Lee et aL . 2005) : (dense, size 32256 x 2414); we compute a rank-64 factor- 
ization. 

3. Random matrix (dense, size 4000 x 4000, entries in [0, 1]); we compute a rank-64 factorization, 
and penalized by A||-||i, and A by 7PH1, with (A, 7) = (10 2 ,10~ 4 ). 

4. Pajek connectivity matrix for Internet routers (sp arse, size 124,651 x 12 4,651, density 1.3T0 -5 , 
from the UFL sparse matrix collection, ID: 1505 [Davis and HuL 12011^ : we compute a rank-4 



factorization; here (A, 7) = (10 , 10 ) were used. 

Figure Q] reports summarizes our experimental results. In the first row, in addition to SPAMS, 
we include running times for Lee and Seung's algorithm, and our implementation of alternating 
(nonnegatively constrained) least squares. From the graph we see that our Matlab implementation 
of NOCOPS runs only slightly slower that the state-of-the-art method in SPAMS. The plots also 
show a dashed line that hints at what might be achievable with a faster C++ implementation of 
Nocops. 

The second row in Figure D] shows numerical results that compare the stochastic generalized 
gradient (SGGD) algorithm of [Ermoliev and Norkinl Il998| against NOCOPS, when started at exactly 
the same point. As is well-known, SGGD requires careful stepsize tuning to be competitive. Thus, 
we searched over a range of possible stepsize choices, and have reported the results with the best 
choices found. NOCOPS also requires some stepsize tuning, but significantly lesser than SGGD. 
Finally, we note that as predicted, the solutions returned by NOCOPS, often have objective function 
values better than SGGD, and always achieved greater sparsity. 

5 Discussion 

We presented a new framework called NOCOPS, which solves a b road class of no nconvex composite 



objective problems. NOCOPS builds on the general analysis of [Solodovl . 11997] . and extends it to 



1 In practice, it is better to use mini-batches, and we used them for all the online algorithms compared. 
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CBCL 



Yale 




Figure 1: Top row: running times on CBCL and Yale B data. Bottom line, SGGD against NOCOPS. The 
densities of solutions returned (left to right) were (100%, 58.8%) for SGGD, and (61.1%, 0.1%) for NOCOPS. 
Initial objective values and very small runtimes have been suppressed for clarity of presentation. 

admit problems that are strictly more general. NOCOPS permits nonvanishing perturbations, which 
is a useful practical feature. We exploited the perturbation analysis to derive both batch and 
incremental versions of NOCOPS. Finally, experiments with medium to large matrices showed that 
NOCOPS is competitive with state-of-the-art methods; NOCOPS was also seen to outperform the 
stochastic generalized gradient method. 

We conclude by mentioning NOCOPS includes numerous algorithms and problem settings as 
special cases. Example are: forward-backward splitting with convex costs, incremental forward- 
backward splitting (convex), gradient projection (both convex and nonconvex), the proximal-point 
algorithm, and so on. Thus, it will be valuable to investigate if some of the theoretical results for 
these methods can be carried over to NOCOPS. Theoretically, the most important open problem 
that we would like to analyze is to permit even the regularize! - in ([IJ to be nonconvex — but this 
might require significantly different convergence analysis. 
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A Implementation notes 

If X is the nonnegative orthant M" , then the proximity operator Pfj often simplifies as 

P»(y) = P*(n x y). (45) 

Additionally, if ip is an elementwise separable function, then one can easily admit a box-plus- 
hyperplane constraint set X of the form 

X = {x e M™ | k < x t < u h for 1 < i < n, and a T x = b} . (46) 

For more general constraint sets, we can invoke Dykstra splitting [Combettes and Pesauet . [2010], 
which solves the problem 

min ±\\x - yf + ip(x) +5(x\X), (47) 

by using the following algorithm 



Dykstra splitting for (|47[) 


Initialize x <- 


- y, p <- 0, q «- 




While 


-i converged, iterate: 






e -s— 


Pf{x+p) 






x <— 


x + p — e 
U x (e + q) 


(48) 




. 1 <~ 


e + q — x . 





It can be shown that iterating (|48|) converges to the solution of (|47|) . In practice, it usually suffices 
to run Dykstra splitting for a few iterations (2-10) only. 
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